cluster quality
- North America > Canada (0.14)
- Europe > Austria > Vienna (0.14)
- North America > United States (0.14)
- (2 more...)
Identifying bias in cluster quality metrics
Renedo-Mirambell, Martí, Arratia, Argimiro
We study potential biases of popular cluster quality metrics, such as conductance or modularity. We propose a method that uses both stochastic and preferential attachment block models construction to generate networks with preset community structures, to which quality metrics will be applied. These models also allow us to generate multi-level structures of varying strength, which will show if metrics favour partitions into a larger or smaller number of clusters. Additionally, we propose another quality metric, the density ratio. We observed that most of the studied metrics tend to favour partitions into a smaller number of big clusters, even when their relative internal and external connectivity are the same. The metrics found to be less biased are modularity and density ratio.
- North America (0.14)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Asia > Middle East > Jordan (0.04)
A Computational Approach to Improving Fairness in K-means Clustering
Zhou, Guancheng, Xu, Haiping, Xu, Hongkang, Li, Chenyu, Yan, Donghui
Clustering is an important problem in data mining. It aims to split the data into groups such that data points in the same group are similar while points in different groups are different under a given similarity metric. Clustering has been successfully applied in many practical applications, such as data grouping in exploratory data analysis, search results categorization, market segmentation etc. Clustering results are often used for further analysis or interpretation. However, directly applying results obtained from usual clustering algorithms may suffer from fairness issues-some cluster may favor data points from one of the subpopulations, i.e., having disproportionally more points. One example of 1 Figure 1: Illustration of the fairness issue in clustering, Points of different color indicate different traits on a sensitive variable, e.g., gender where blue indicates male and red female. Cluster 1 is dominated by females while Cluster 2 by males. Points with an arrow indicate that we might switch its cluster membership assignment to make the clusters less dominated by one subpopulation.
- North America > United States > Massachusetts > Bristol County > Dartmouth (0.14)
- Asia > Middle East > Jordan (0.05)
Novel Topological Machine Learning Methodology for Stream-of-Quality Modeling in Smart Manufacturing
Lee, Jay, Ji, Dai-Yan, Hsu, Yuan-Ming
This paper presents a topological analytics approach within the 5-level Cyber-Physical Systems (CPS) architecture for the Stream-of-Quality assessment in smart manufacturing. The proposed methodology not only enables real-time quality monitoring and predictive analytics but also discovers the hidden relationships between quality features and process parameters across different manufacturing processes. A case study in additive manufacturing was used to demonstrate the feasibility of the proposed methodology to maintain high product quality and adapt to product quality variations. This paper demonstrates how topological graph visualization can be effectively used for the real-time identification of new representative data through the Stream-of-Quality assessment.
Enhancing Cluster Quality of Numerical Datasets with Domain Ontology
Heiyanthuduwage, Sudath Rohitha, Rahman, Md Anisur, Islam, Md Zahidul
Ontology-based clustering has gained attention in recent years due to the potential benefits of ontology. Current ontology-based clustering approaches have mainly been applied to reduce the dimensionality of attributes in text document clustering. Reduction in dimensionality of attributes using ontology helps to produce high quality clusters for a dataset. However, ontology-based approaches in clustering numerical datasets have not been gained enough attention. Moreover, some literature mentions that ontology-based clustering can produce either high quality or low-quality clusters from a dataset. Therefore, in this paper we present a clustering approach that is based on domain ontology to reduce the dimensionality of attributes in a numerical dataset using domain ontology and to produce high quality clusters. For every dataset, we produce three datasets using domain ontology. We then cluster these datasets using a genetic algorithm-based clustering technique called GenClust++. The clusters of each dataset are evaluated in terms of Sum of Squared-Error (SSE). We use six numerical datasets to evaluate the performance of our ontology-based approach. The experimental results of our approach indicate that cluster quality gradually improves from lower to the higher levels of a domain ontology.
- Oceania > Australia (0.05)
- Europe > Hungary > Hajdú-Bihar County > Debrecen (0.04)
- Asia > Middle East > Republic of Türkiye (0.04)
- North America > United States > California > Orange County > Irvine (0.04)
Text Mining Through Label Induction Grouping Algorithm Based Method
Saleem, Gulshan, Ahmed, Nisar, Qamar, Usman
The main focus of information retrieval methods is to provide accurate and efficient results which are cost-effective too. LINGO (Label Induction Grouping Algorithm) is a clustering algorithm that aims to provide search results in form of quality clusters but also has a few limitations. In this paper, our focus is based on achieving results that are more meaningful and improving the overall performance of the algorithm. LINGO works on two main steps; Cluster Label Induction by using Latent Semantic Indexing technique (LSI) and Cluster content discovery by using the Vector Space Model (VSM). As LINGO uses VSM in cluster content discovery, our task is to replace VSM with LSI for cluster content discovery and to analyze the feasibility of using LSI with Okapi BM25. The next task is to compare the results of a modified method with the LINGO original method. The research is applied to five different text-based data sets to get more reliable results for every method. Research results show that LINGO produces 40-50% better results when using LSI for content Discovery. From theoretical evidence using Okapi BM25 for scoring method in LSI (LSI+Okapi BM25) for cluster content discovery instead of VSM, also results in better clusters generation in terms of scalability and performance when compares to both VSM and LSI's Results.
- Asia > Pakistan > Punjab > Lahore Division > Lahore (0.06)
- North America > United States > Hawaii (0.04)
- Asia > Taiwan (0.04)
- Asia > Pakistan > Islamabad Capital Territory > Islamabad (0.04)
Bipartite Stochastic Block Models with Tiny Clusters
We study the problem of finding clusters in random bipartite graphs. We present a simple two-step algorithm which provably finds even tiny clusters of size $O(n^\epsilon)$, where $n$ is the number of vertices in the graph and $\epsilon > 0$. Previous algorithms were only able to identify clusters of size $\Omega(\sqrt{n})$. We evaluate the algorithm on synthetic and on real-world data; the experiments show that the algorithm can find extremely small clusters even in presence of high destructive noise.
- Europe > Austria > Vienna (0.14)
- North America > United States (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- (2 more...)
Bipartite Stochastic Block Models with Tiny Clusters
We study the problem of finding clusters in random bipartite graphs. We present a simple two-step algorithm which provably finds even tiny clusters of size $O(n^\epsilon)$, where $n$ is the number of vertices in the graph and $\epsilon > 0$. Previous algorithms were only able to identify clusters of size $\Omega(\sqrt{n})$. We evaluate the algorithm on synthetic and on real-world data; the experiments show that the algorithm can find extremely small clusters even in presence of high destructive noise.
- Europe > Austria > Vienna (0.14)
- North America > United States (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- (2 more...)
Document Clustering Evaluation: Divergence from a Random Baseline
De Vries, Christopher M., Geva, Shlomo, Trotman, Andrew
Divergence from a random baseline is a technique for the evaluation of document clustering. It ensures cluster quality measures are performing work that prevents ineffective clusterings from giving high scores to clusterings that provide no useful result. These concepts are defined and analysed using intrinsic and extrinsic approaches to the evaluation of document cluster quality. This includes the classical clusters to categories approach and a novel approach that uses ad hoc information retrieval. The divergence from a random baseline approach is able to differentiate ineffective clusterings encountered in the INEX XML Mining track. It also appears to perform a normalisation similar to the Normalised Mutual Information (NMI) measure but it can be applied to any measure of cluster quality. When it is applied to the intrinsic measure of distortion as measured by RMSE, subtraction from a random baseline provides a clear optimum that is not apparent otherwise. This approach can be applied to any clustering evaluation. This paper describes its use in the context of document clustering evaluation.
- Oceania > New Zealand > South Island > Otago > Dunedin (0.04)
- Oceania > Australia > Queensland > Brisbane (0.04)
- North America > United States (0.04)
- (2 more...)